This report explores a dataset containing 11 chemical aspects of 4898 white wines and their quality measured by experts. There are no missing values.
## [1] 4898 13
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
The Maximum Value for Quality is 9 and the Minimum value is 3. However, the vast majority of wine is in the 5 to 7 range. The variation looks close to normal.
However, these are ratings measured by experts. In other words, they are ordinal values; a wine with rating 9 is NOT three times better than a wine that received a score of 3. To better analyze the data set, I decided to add a new variable that classifies a wine into one of three categories: “Low” “Average” and “Excellent” A wine with a score below six (so 5 or less) is categorized as a low quality wine, while a wine with score 6 is considered to be average, and 7, 8, 9 being excellent.
##
## Low Average Excellent
## 1640 2198 1060
Now, here are some graphic displays of distributions of chemical characteristics.
For variables fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, and total sulfur dioxide, it was quite visible that few outliers distort general shapes of the distributions. Therefore, new graphs were drwan excluding values outside the 1~99th percentile range.
The distributions for volatile acidity, free sulfur dioxide, total sulfur dioxide, pH, sulphates and fixed acidity seemed close to normal when outliers were excluded.
However, there was a strange level of citric acid that unusually many wines shared. 215 wines had 0.49 g/dm^3 of citric acid in it. This is very peculiar.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
The distribution for residual sugar is also interesting. It is skewed to the right to a high degree. However, when log transform was conducted to the distribution, it was visible that the distribution was bimodal. There was only one wine however, that had more than 45 g/dm^3 of residual sugar; therefore, there was only one wine that could be considered “sweet” according to the description in the data set.
Amount of chlorides in a wine take a distribution that resembles a normal curve closely. However, it is notable that there are many wines beyond 0.08 level distributed consistently.
Alcohol levels in wines range from 8 to about 14. The distribution looks slightly skewed to the right.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
Density of wines is very close to 1. Largest variation is only 0.0390. Nonetheless, there are some variations due to chemical compounds in wines.
This dataset is a dataframe with 4989 rows and 13 columns. Therefore, it has data of 4989 white wines and 13 characteristics associated with them. Although there are 13 variables in the data set, since variable X is just a numbering of all the wines, there are essentially 12 variables in the data set. Moreover, 11 variables such as pH, alcohol, etc. are independent variables and 1 vairable is a dependent variable or a resulting variable which is the quality variable.
1 - fixed acidity (tartaric acid - g / dm^3)
2 - volatile acidity (acetic acid - g / dm^3)
3 - citric acid (g / dm^3)
4 - residual sugar (g / dm^3)
5 - chlorides (sodium chloride - g / dm^3)
6 - free sulfur dioxide (mg / dm^3)
7 - total sulfur dioxide (mg / dm^3)
8 - density (g / cm^3)
9 - pH
10 - sulphates (potassium sulphate - g / dm3)
11 - alcohol (% by volume)
Output variable (based on sensory data):
12 - quality (score between 0 and 10)
At first, I wanted to find a variable or variables that are most closely related to the quality of wine. However, as I will explain through out this report, such attempt was futile. In any case, I had some variables that I suspected to have high correlations. I will focus on residual sugar, citric acid, and chlorides as they are chemical compositions that determine important aspects of tastes: sweetness, flavor, and saltiness of wine.
Intuitively, density and alcohol are variables that do not seem to affect quality of wine. Rather, they will be resulting variables that are affected by other variables such as residual sugar and sulphates. It will interesting to see how the independent variables are actually correlated.
I created a categorical variable called “Level.” As I mentioned, as the quality variable is an ordinal variable, using categorical variable may help visualize which factor affects quality. Therefore, I created three categories “Low”, “Average”, and “Excellent” and placed wines accordingly.
For many distributions, a few outliers distorted the shape of distributions. Therefore, I chose to graph data within 1% to 99% percentile range. Secondly, I log-transformed residual sugar data. The distribution for residual sugar was highly skewed to the right. Log transforming this data revealed that the distribution was bimodal.
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.00 -0.02 0.29
## volatile.acidity -0.02 1.00 -0.15
## citric.acid 0.29 -0.15 1.00
## residual.sugar 0.09 0.06 0.09
## chlorides 0.02 0.07 0.11
## free.sulfur.dioxide -0.05 -0.10 0.09
## total.sulfur.dioxide 0.09 0.09 0.12
## density 0.27 0.03 0.15
## pH -0.43 -0.03 -0.16
## sulphates -0.02 -0.04 0.06
## alcohol -0.12 0.07 -0.08
## quality -0.11 -0.19 -0.01
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.09 0.02 -0.05
## volatile.acidity 0.06 0.07 -0.10
## citric.acid 0.09 0.11 0.09
## residual.sugar 1.00 0.09 0.30
## chlorides 0.09 1.00 0.10
## free.sulfur.dioxide 0.30 0.10 1.00
## total.sulfur.dioxide 0.40 0.20 0.62
## density 0.84 0.26 0.29
## pH -0.19 -0.09 0.00
## sulphates -0.03 0.02 0.06
## alcohol -0.45 -0.36 -0.25
## quality -0.10 -0.21 0.01
## total.sulfur.dioxide density pH sulphates alcohol
## fixed.acidity 0.09 0.27 -0.43 -0.02 -0.12
## volatile.acidity 0.09 0.03 -0.03 -0.04 0.07
## citric.acid 0.12 0.15 -0.16 0.06 -0.08
## residual.sugar 0.40 0.84 -0.19 -0.03 -0.45
## chlorides 0.20 0.26 -0.09 0.02 -0.36
## free.sulfur.dioxide 0.62 0.29 0.00 0.06 -0.25
## total.sulfur.dioxide 1.00 0.53 0.00 0.13 -0.45
## density 0.53 1.00 -0.09 0.07 -0.78
## pH 0.00 -0.09 1.00 0.16 0.12
## sulphates 0.13 0.07 0.16 1.00 -0.02
## alcohol -0.45 -0.78 0.12 -0.02 1.00
## quality -0.17 -0.31 0.10 0.05 0.44
## quality
## fixed.acidity -0.11
## volatile.acidity -0.19
## citric.acid -0.01
## residual.sugar -0.10
## chlorides -0.21
## free.sulfur.dioxide 0.01
## total.sulfur.dioxide -0.17
## density -0.31
## pH 0.10
## sulphates 0.05
## alcohol 0.44
## quality 1.00
Study of correlations between all the variables reveal that for most variables, the correlations were actually quite low. This is especially notable for correlation between quality and other variables. This is what is expected. I suspected earlier that quality is not likely to be correlated to a single variable.
When all the correlations deemed insignificant (alpha=0.01) are crossed out, not many correlations survive. However, nonetheless, the correlations provide some interesting insights. We will look at some higher correlations and their implications.
## row column cor p
## 25 residual.sugar density 0.8389666080 0.000000e+00
## 53 density alcohol -0.7801375389 0.000000e+00
## 21 free.sulfur.dioxide total.sulfur.dioxide 0.6155009866 0.000000e+00
## 28 total.sulfur.dioxide density 0.5298813581 0.000000e+00
## 49 residual.sugar alcohol -0.4506312311 0.000000e+00
## 52 total.sulfur.dioxide alcohol -0.4488920867 0.000000e+00
## 66 alcohol quality 0.4355747104 0.000000e+00
## 29 fixed.acidity pH -0.4258582890 0.000000e+00
## 19 residual.sugar total.sulfur.dioxide 0.4014393091 0.000000e+00
## 50 chlorides alcohol -0.3601887226 0.000000e+00
## 63 density quality -0.3071234226 0.000000e+00
## 14 residual.sugar free.sulfur.dioxide 0.2990983427 0.000000e+00
## 27 free.sulfur.dioxide density 0.2942104042 0.000000e+00
## 2 fixed.acidity citric.acid 0.2891806960 0.000000e+00
## 22 fixed.acidity density 0.2653309703 0.000000e+00
## 26 chlorides density 0.2572113872 0.000000e+00
## 51 free.sulfur.dioxide alcohol -0.2501039505 0.000000e+00
## 60 chlorides quality -0.2099344134 0.000000e+00
## 20 chlorides total.sulfur.dioxide 0.1989102960 0.000000e+00
## 57 volatile.acidity quality -0.1947229654 0.000000e+00
## 32 residual.sugar pH -0.1941334456 0.000000e+00
## 62 total.sulfur.dioxide quality -0.1747372150 0.000000e+00
## 31 citric.acid pH -0.1637482196 0.000000e+00
## 45 pH sulphates 0.1559514850 0.000000e+00
## 24 citric.acid density 0.1495025158 0.000000e+00
## 3 volatile.acidity citric.acid -0.1494718194 0.000000e+00
## 43 total.sulfur.dioxide sulphates 0.1345623732 0.000000e+00
## 54 pH alcohol 0.1214321032 0.000000e+00
## 18 citric.acid total.sulfur.dioxide 0.1211307943 0.000000e+00
## 46 fixed.acidity alcohol -0.1208811179 0.000000e+00
## 9 citric.acid chlorides 0.1143644452 8.881784e-16
## 56 fixed.acidity quality -0.1136628240 1.332268e-15
## 15 chlorides free.sulfur.dioxide 0.1013923511 1.139311e-12
## 64 pH quality 0.0994272381 3.080647e-12
## 59 residual.sugar quality -0.0975768268 7.724044e-12
## 12 volatile.acidity free.sulfur.dioxide -0.0970119387 1.019163e-11
## 6 citric.acid residual.sugar 0.0942116231 3.935585e-11
## 13 citric.acid free.sulfur.dioxide 0.0940772220 4.195155e-11
## 36 density pH -0.0935915634 5.280354e-11
## 16 fixed.acidity total.sulfur.dioxide 0.0910697579 1.711435e-10
## 33 chlorides pH -0.0904394686 2.284974e-10
## 17 volatile.acidity total.sulfur.dioxide 0.0892605036 3.902969e-10
## 4 fixed.acidity residual.sugar 0.0890206993 4.348371e-10
## 10 residual.sugar chlorides 0.0886845365 5.057195e-10
## 48 citric.acid alcohol -0.0757287294 1.119361e-07
## 44 density sulphates 0.0744931176 1.795269e-07
## 8 volatile.acidity chlorides 0.0705115721 7.824606e-07
## 47 volatile.acidity alcohol 0.0677179396 2.100319e-06
## 5 volatile.acidity residual.sugar 0.0642860606 6.712237e-06
## 39 citric.acid sulphates 0.0623309389 1.268864e-05
## 42 free.sulfur.dioxide sulphates 0.0592172444 3.369446e-05
## 65 sulphates quality 0.0536778755 1.709793e-04
## 11 fixed.acidity free.sulfur.dioxide -0.0493958555 5.437313e-04
## 38 volatile.acidity sulphates -0.0357281491 1.239761e-02
## 30 volatile.acidity pH -0.0319153704 2.550817e-02
## 23 volatile.acidity density 0.0271139368 5.776822e-02
## 40 residual.sugar sulphates -0.0266643669 6.204414e-02
## 7 fixed.acidity chlorides 0.0230856426 1.062094e-01
## 1 fixed.acidity volatile.acidity -0.0226972941 1.122218e-01
## 55 sulphates alcohol -0.0174327735 2.225307e-01
## 37 fixed.acidity sulphates -0.0171429832 2.303158e-01
## 41 chlorides sulphates 0.0167628806 2.408178e-01
## 58 citric.acid quality -0.0092090871 5.193461e-01
## 61 free.sulfur.dioxide quality 0.0081580672 5.681271e-01
## 35 total.sulfur.dioxide pH 0.0023209811 8.709954e-01
## 34 free.sulfur.dioxide pH -0.0006177853 9.655221e-01
Among top 5 correlations, three were related to density. Correlation between density and residual sugar was 0.84 ranking number one on the list while correlation between density and alcohol was -0.78 and correlation between total SO2 and density was 0.53 ranking two and four respectively. Considering the third largest correlation was between free SO2 and total SO2, two variables that are obviously correlated, it can be concluded that density has notably high correlations with other variables.
The correlations are quite strong. Since sugar and SO2 are more dense than water, density should increase if wine contains more sugar or SO2. In contrast, density should decrease if wine has higher alcohol level since alcohol is less dense than water. Therefore, these correlations make sense.
Alcohol is produced from fermantation. Since amount of sugar and yeast activity determines fermantation, alcohol should be correlated with them. Residual sugar will be negatively correlated with amount of sugar used during fermantation. Furthermore, as SO2 is an anti-microbial agent, the amount of SO2 will be negatively correlated with the degree of fermantation taken place. Therefore, negative correlations between alcohol and both variables make sense.
Two variables most closely related to quality are density and alcohol. However, scatterplots do not reveal the relationship clearly. Therefore, boxplots were drawn using levels instead of quality measures.
Now, it is visible that higher quality wines tend to have more alcohol but are less dense. However, as seen before, if a wine has more alcohol, it is going to be less dense. Moreover, the two variables closely related are themselves variables that are connected with several other variables as we have explored previously. Therefore, it can be concluded that no single variable determines quality of wine.
Excellent wines have lower median value for residual sugar and chlorides but the median values for citric acid do not differ greatly across different levels of wine. However, one thing distingushably different for excellent wines is that excellent wines have smaller variations in all three variables. Perhaps, it is a moderate amount of everything that makes excellent wine.
Correlations between various variables and quality were not very strong. In fact, the strongest correlations between a variable and quality occurred between density and quality and alcohol and quality. However, even those correlations were not strong. But, when graphs were drwan using categorical variables, levels, the relationship became visible. Higher quality wines tend to have more alcohol but are less dense.
Further analysis revealed that median values for residual sugar, citric acid, and chlorides do not differ greatly across different levels of wine. However, great wines have much smaller variations in all three variables as shown by the box plots. This led me to a new suspicion that what makes a wine a great wine is the blending of several tastes; in other words, a moderate amount of every feature makes an excellent wine. This will be investigated further in Multivariate Analysis.
Among top 4 correlations, three were related to density. Considering the third largest correlation was between free SO2 and total SO2, two variables that are obviously correlated, it can be concluded that density has notably high correlations with other variables. But, of course, density changes as substances with different density is added to the liquid (sugar being more dense and alcohol being less dense), these correlations make sense.
Also, alcohol was highly correlated with residual sugar and total amount of SO2. since amount of sugar and yeast activity determines fermantation, alcohol, a produce of fermentation, should be correlated with them. Residual sugar will be negatively correlated with amount of sugar used during fermantation. Furthermore, as SO2 is an anti-microbial agent, the amount of SO2 will be negatively correlated with the degree of fermantation taken place. Therefore, negative correlations between alcohol and both variables make sense.
Correlation between density and residual sugar was 0.84 ranking number one on the list. Since sugar is more dense than water, density should increase if wine contains more sugar. Thus, the relationship did not deviate from what was expected.
As visible from the graphs, green dots are located around the center of the distributions When residual sugar and citric acid or chlorides are plotted against one another. When one of the variables, whether it be residual sugar or chlorides or citric acid, is too high, the dots are blue or red. The high concentration of green dots around the center visually shows that a nice blending of all the flavors makes a wine excellent.
The distributions for free SO2, total SO2, volatile acid, and fixed acid all show similar patterns. Green dots are located around the center of the distritbutions, both horizontally and vertically. However, the central tendency is not as strong for the four variables compared to the distributions made with residual sugar, citric acid, and chlorides.
When ploitting against density and alcohol, it was visible that excellent wines are located at the section on the lower right corner. This means that excellent wines have higher alcohol concentration and lower density. However, I do not think that having higher dose of alcohol makes a wine excellent. Since most variables are negatively correlated with alcohol, having a large amount of certain variable would mean a decrease in alcohol concentration. Because excellent wines have moderate amount of everything, it will be more likely that they will have higher percentage of alcohol. And as alcohol and density are negatively correlated, excellent wines are bound to have less density values.
My new hypothesis from Bivariate Analysis was confirmed. Looking at plots with x variable as residual sugar and y variable as citric acid or chlorides and coloring each dot by levels, I could see that excellent wines were located near the center of the distributions. This meant that excellent wines are wines that have moderate amount of all chemical compounds: not too much and not too little, just the right amount. In fact, the trend was consistent with other chemical characteristics such as free SO2, total SO2, fixed acidity, and volatile acidity.
Plotting distributions with alcohol and density as variables colored by levels revealed that excellent wines tend to have more alcohol and less density. Since alcohol is negatively correlated with most variables, this makes sense.
The distribution for residual sugar was skewed to the right to a high degree. However, when log transform was conducted to the distribution, it was visible that the distribution was bimodal.
The median values for citric acid did not differ greatly across different levels of wine. However, one thing distingushably different for excellent wines is that excellent wines have smaller variations in all three variables. Perhaps, it is a moderate amount of everything that makes excellent wine.
As visible from the graphs, green dots are located around the center of the distributions When residual sugar and citric acid are plotted against one another. When one of the variables, whether it be residual sugar or chlorides or citric acid, is too high, the dots are blue or red. The high concentration of green dots around the center visually shows that a nice blending of all the flavors makes a wine excellent.
Frankly, I do not drink wine very often and personally cannot distinguish a “good” wine from a “bad” one. I chose this data set because I was curious how to differentiate a good wine from a bad one. Some wines are priced thousands of dollars! What makes such wines so special? Could I make a model that predicts price of a wine based on its chemical characteristics? What is the most important factor that determines the quality of wine. Questions like those were beginning of this research.
However, when I calculated correlation values for the data in the exploratory phase, I ran into a trouble. As shown above, the correlations were very low for most variables. If a relationship has a correlation whose absolute value is less than 0.3, then it is usually considered “no relationship.” Then there were only two variables considered to have a relationship with quality. Moreover, those two variables are alcohol and density, which themselves are dependent variables!
At first, I suspected that quality of wine is not determined by the wine’s chemical compositions. It may be the brand, the appearance of the bottle, or the sommelier’s mood at the time that determine the wine’s quality. Then I realized that quality variable was an ordinal variable; a wine with quality 9 is not three times better than a wine with quality 3. I tried to make a model that predicts quality value as we did for the diamond data set but only then I realized that price variable and quality variable were innately different. So I constructed a categorical variable, “levels” and looked for trends. Setting up level variable was the key decision that made this research possible.
Several patterns emerged after that. Unlike what I expected, the median values of many variables were not that different for wine of different levels. However, the ranges for the variables were visibly smaller for excellent wines. Indeed, fervid wine drinkers often look for completeness when they drink wine. In other words, they want wines with a variety of tastes blending together. My bivariate analysis and multivariate analysis supports the idea that a good wine is a wine that has a moderate amount of everything.
Lastly, it would be desirable to include price variable for future research. Exploring relationship between price of wine and quality of wine and the relationship between price of wine and chemical composition of wine would be very interesting. Furthermore, I want to conduct a comparative research with red wine data. I suspect that sommeliers look for different blending of tastes in red wine than in white wine and I want to see if that is true.